[SPARK-18025] Use commit protocol API in structured streaming#15710
[SPARK-18025] Use commit protocol API in structured streaming#15710rxin wants to merge 8 commits intoapache:masterfrom
Conversation
| outputPath.toString, | ||
| isAppend) | ||
|
|
||
| WriteOutput.write( |
There was a problem hiding this comment.
I'm thinking I should just rename WriteOutput to FileFormatOutput
|
Test build #3387 has finished for PR 15710 at commit
|
|
Test build #67873 has finished for PR 15710 at commit
|
|
Test build #67874 has finished for PR 15710 at commit
|
|
Test build #67877 has finished for PR 15710 at commit
|
|
Test build #67913 has finished for PR 15710 at commit
|
|
Test build #67912 has finished for PR 15710 at commit
|
| } | ||
|
|
||
| override def commitTask(taskContext: TaskAttemptContext): TaskCommitMessage = { | ||
| if (addedFiles.nonEmpty) { |
There was a problem hiding this comment.
Is this just an optimization to avoid instantiating the fs for empty writes?
There was a problem hiding this comment.
I was copying the same logic from before -- but i think so...
There was a problem hiding this comment.
actually the other thing is that we are using the head. Technically we can use headOption and than map over it but it will be pretty weird ..
| } | ||
| } | ||
|
|
||
| import org.apache.spark.sql.execution.datasources.OutputWriter |
There was a problem hiding this comment.
It's not. This is the top.
| import testImplicits._ | ||
|
|
||
|
|
||
| test("FileStreamSinkWriter - unpartitioned data") { |
There was a problem hiding this comment.
They were testing code that's been deleted completely and is now purely redundant with all the tests we have for the batch write path.
|
LGTM |
|
Thanks, merging to master. |
## What changes were proposed in this pull request? This patch adds a new commit protocol implementation ManifestFileCommitProtocol that follows the existing streaming flow, and uses it in FileStreamSink to consolidate the write path in structured streaming with the batch mode write path. This deletes a lot of code, and would make it trivial to support other functionalities that are currently available in batch but not in streaming, including all file formats and bucketing. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#15710 from rxin/SPARK-18025.
What changes were proposed in this pull request?
This patch adds a new commit protocol implementation ManifestFileCommitProtocol that follows the existing streaming flow, and uses it in FileStreamSink to consolidate the write path in structured streaming with the batch mode write path.
This deletes a lot of code, and would make it trivial to support other functionalities that are currently available in batch but not in streaming, including all file formats and bucketing.
How was this patch tested?
Should be covered by existing tests.